Classification and Regression Trees
Readings: ISLR Section 8.1, Code
Bullet items with ^{\dagger} are not in the reading, so not tested.
Setup
Goal. Given data \{\left({x_i, y_i}\right)\}_{i = 1}^{N}, develop a predictor for y_{i} that handles both regression (when y_i is continuous) and classification (when y_i is categorical).
Requirements.
- Tree structure. The decision function \hat{f}\left(x\right) must be representable as a tree. Each prediction can be traced through a sequence of decision nodes. This supports interpretability.
- Flexible inputs. The features x_{i} may be continuous or categorical types.
- Feature interactions. The model must have the flexibility to learn feature interactions. That is, \hat{f}\left(x\right) can represent relationships where the effect of feature x_{j} on y depends on the value of x_{k} (e.g., y_{i} is in class A if and only if both x_{i1} and x_{i2} are positive).
Approach.
- Partition representation. Observe a correspondence between the L leaves of a tree diagram and size L partitions of the input space \mathcal{X} = R_{1} \cup R_{2} \cup \dots \cup R_{L}.
- Prediction rule. For a test point x^* \in R_{l}, predict using the training samples in R_{l}: average y_{i} for regression or majority class for classification.
- Tree construction. Grow the tree recursively using a greedy algorithm that optimizes locally for good prediction performance. Prune the tree using a validation set to ensure generalization ability.
Trees \iff Partitions
For now, let’s consider x = \left(x_{1}, \dots, x_{J}\right) with only continuous features. Each node in a decision tree tests whether a feature x_{j} \geq s or x_{j} < s for some threshold s. We proceed down the right branch if and only if x_j \geq s.
A leaf node is a node without any descendants. Following the path from the root to a leaf, we impose a series of constraints on x corresponding to axis-aligned rectangles.
Since each x follows exactly one root-to-leaf path, the L leaves define L nonoverlapping regions \left(R_{l}\right)_{l = 1}^{L} that cover the entire input space: \mathcal{X} = R_{1} \cup R_{2} \cup \dots \cup R_{L}.
Exercise: What tree does this partition correspond to? Make sure to annotate the regions R_{l} and split rules.
Growing Algorithm
1D case. How should we learn the partition from data \{\left({x_i, y_i}\right)\}_{i = 1}^{N}? First suppose we have only one feature and are only allowed one split. The resulting partition has only two elements R^{\text{left}} and R^{\text{right}}, divided at a threshold s. We choose the s that minimizes \begin{align*} \sum_{i: x_i \in R^{\text{left}}(s)} (y_i - \bar{y}_{R^{\text{left}}})^2 + \sum_{i: x_i \in R^{\text{right}}(s)} (y_i - \bar{y}_{R^{\text{right}}})^2 \end{align*}
where \bar{y}_{R} = \frac{1}{|R|}\sum_{i: x_i \in R} y_i is the mean training response in region R_{l}.
Exercise: The formula above has squared error terms like \left(y_i - \bar{y}_{R^{\text{left}}}\right)^2. How would you modify the figure to represent these quantities?
Inductive step. Suppose we have already made L splits, so we have L + 1 disjoint intervals R_{1}, \dots, R_{L + 1} with current total residual sum of squares (RSS): \sum_{l=1}^{L +1} \sum_{i: x_i \in R_l} (y_i - \bar{y}_{R_l})^2 To make the next split, consider the RSS reduction from splitting region l at threshold s \Delta(l, s) = \sum_{i: x_i \in R_l} (y_i - \bar{y}_{R_l})^2 - \left[\sum_{i: x_i \in R_l^\text{left}(s)} (y_i - \bar{y}_{R_l^\text{left}})^2 + \sum_{i: x_i \in R_l^\text{right}(s)} (y_i - \bar{y}_{R_l^\text{right}})^2\right] \tag{1} Here, R_l^\text{left}(s) and R_l^\text{right}(s) denotes the left and right intervals made when splitting R_l at s. We can test all regions l \in \{1, \dots, L + 1\} and a fine grid of s within each R_{l}. The best split is: \left(l^*, s^*\right) := \arg \max_{l, s} \Delta\left(l, s\right)
We stop growing the tree when we’ve reached a stopping criterion; e.g., when the best RSS improvement is too small or the leaves have too few samples to split further.
A similar approach works in higher dimensions – we just need to also optimize over which feature to split on in each inductive step.
Specifically, consider splitting leaf l along feature j at threshold s. The associated RSS reduction is: \Delta(l, j, s) = \sum_{i: x_i \in R_l} (y_i - \bar{y}_{R_l})^2 - \left[\sum_{i: x_i \in R_l^{\text{left}}(j,s)} (y_i - \bar{y}_{R_l^{\text{left}}})^2 + \sum_{i: x_i \in R_l^{\text{right}}(j,s)} (y_i - \bar{y}_{R_l^{\text{right}}})^2\right] \tag{2} where R_l^{\text{left}}(j,s) = \{x \in R_l : x_j \leq s\} and R_l^{\text{right}}(j,s) = \{x \in R_l : x_j > s\}. We search over all candidate regions l, features j, and thresholds s to find: (l^*, j^*, s^*) := \arg\max_{l, j, s} \Delta(l, j, s)
Exercise: Compare and contrast the formulas in Equation 1 and Equation 2 .
Classification
So far we’ve focused on regression. For classification, we still split recursively but can’t use \Delta\left(l, j, s\right) as our objective.
Our first thought might be to use classification error rate instead of RSS. This is not a good idea, because we might want to split a region R_{l} to improve “node purity,” even if it doesn’t improve the overall training error rate. For example, the split below is still worth making, even though the overall error rate is 40% in both cases. This is because the error rate in R_3 is now \approx 27\% and is likely to generalize well.
R_init (100 points): [●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ○○○○○○○○○○○○○○○○○○○○] 60:40 ratio After: R_2 (70 pts): [●●●●●●●●●●●●●●●●●●● ○○○○○○○○○○○○○○○○] 38:32 R_3 (30 pts): [●●●●●●●●●●● ○○○○] 22:8To measure node purity, we can use either the Gini Index or the Entropy. If y belongs to one of K classes and if \hat{p}_{lk} denotes the proportion of class k samples in region R_{l}, these are defined as,
\begin{align*} \text{Gini}\left(R_{l}\right) &= \sum_{k= 1}^{K} \hat{p}_{lk}\left(1 - \hat{p}_{lk}\right)\\ \text{Entropy}\left(R_{l}\right) &= -\sum_{k= 1}^{K} \hat{p}_{lk}\log\left(\hat{p}_{lk}\right) \end{align*}
Geometrically, these two functions have similar forms. The corners of the simplex below correspond to cases where most samples come from one of the K = 3 classes. A good classification tree will have learned regions R_{l} whose class distributions are close to one of the K corners of the simplex (similarly to how a good regression tree learns regions R_{l} with low RSS).
Exercise: What is the analogous figure when there are K = 2 classes?
Given this new objective, we can define the tree growing induction as before: \Delta(l, j, s) = F\left(R_l\right) - \left[F\left(R_{l}^{\text{left}}(j, s)\right) + F\left(R_{l}^{\text{right}}(j, s)\right) \right] where F denotes either the Gini index or Entropy. As before, we search over candidate leaves, directions, and thresholds to select the best possible split, (l^*, j^*, s^*) := \arg\max_{l, j, s} \Delta(l, j, s)
Categorical Predictors
So far we’ve only considered continuous features x \in \reals^{J}. For categorical feature x_j with Q levels, we split by choosing a subset A \subset \{1, \dots, Q\} and splitting on whether x_{j} \in A.
The RSS reduction for a categorical split becomes,
\Delta(l, j, A) = \sum_{i: x_i \in R_l} (y_i - \bar{y}_{R_l})^2 - \left[\sum_{i: x_i \in R_l, \, x_{ij} \in A} (y_i - \bar{y}_{A})^2 + \sum_{i: x_i \in R_l, \, x_{ij} \in A^C} (y_i - \bar{y}_{A^C})^2\right] where R_l^{A}(j) = \{x \in R_l : x_j \in A\} and similarly for A^{C}. We search over all candidate l, j, and subsets A to find a choice that optimizes the RSS.
^{\dagger} When Q is large, there are 2^{Q - 1} possible subsets of A. For regression, we can reduce this to just Q - 1 splits.
- For each level q \in \{1, \ldots, Q\}, compute the average y_i among training points in R_l with that level: \bar{y}_{q} = \frac{1}{|\{i : x_i \in R_l, \, x_{ij} = q\}|} \sum_{i : x_i \in R_l, \, x_{ij} = q} y_{i}
- Sort the levels by these averages: \bar{y}_{q_1} \leq \bar{y}_{q_2} \leq \cdots \leq \bar{y}_{q_Q}
- Only consider splits that respect this ordering: A = \{q_1, \ldots, q_s\} for s = 1, \ldots, Q-1
Intuitively, levels with similar average responses are grouped together in candidate splits.
Exercise: In the figure above, why don’t we consider A = \{1\} as a candidate split?
Similar ordering strategies are possible for classification trees, but they are beyond the scope of this class.
Cost Complexity Pruning
The tree growing algorithm may overfit, producing large trees with poor test performance and interpretability.
We could consider requiring larger improvements in \Delta\left(l, j, s\right) before making a split. However, this is not a good idea – splits that don’t seem promising at first might enable valuable splits later, especially when features interact. For example, the initial split for x_{1} below only gives a modest \Delta, since there is a mix of high and low y values. After the split on x_{2}, though, the prediction is nearly perfect. A conservative rule that rejects the first split would miss this signal entirely.
Exercise: Recall the definition of a feature interaction given at the start of these notes. How does this figure give an example of such an interaction?
Let T_0 be the full tree. For penalty \alpha \geq 0, we find a subtree T \subset T_0 (formed by collapsing internal nodes) that minimizes: \mathcal{L}_{\alpha} = \sum_{l=1}^{|T|} \sum_{i: x_i \in R_l} (y_i - \bar{y}_{R_l})^2 + \alpha |T| where |T| is the number of leaves in T. The first term is the training error, while \alpha \left|T\right| penalizes tree complexity. When \alpha = 0, we keep the original tree. When \alpha increases, the optimal subtree gets smaller. We tune \alpha with cross-validation.
^{\dagger} For an internal node t in the tree, define,
- T_t: The subtree rooted at t. I.e., the tree with root t and all its descendants.
- \text{RSS}\left(t\right): The RSS if we collapse T_t into a single leaf t, \begin{align*} \text{RSS}\left(t\right) &= \sum_{i \in T_{t}} \left(y_i - \bar{y}_{T_t}\right)^{2} \end{align*} where \bar{y}_{T_{t}} is the average response among training samples in T_{t}.
- \text{RSS}\left(T_t\right): The RSS when keeping all leaves in T_t, \begin{align*} \text{RSS}\left(T_t\right) &= \sum_{l \in T_t}\sum_{i \in R_{l}} \left(y_i - \bar{y}_{R_{l}}\right)^{2}. \end{align*}
Exercise: For any given t, is RSS\left(t\right) \geq RSS\left(T_t\right), RSS\left(t\right) < RSS\left(T_t\right), or “it depends on t”?
^{\dagger} Keeping T_t vs. collapsing it improves the RSS by \text{RSS}(t) - \text{RSS}(T_t) at a cost of |T_t| - 1 extra leaves. The cost-benefit ratio is: \alpha_t := \frac{\text{RSS}(t) - \text{RSS}(T_t)}{|T_t| - 1} This is the RSS reduction per additional leaf. Nodes with small \alpha_t provide little benefit relative to their complexity.
^{\dagger} This suggests the weakest-link pruning algorithm.
- Compute \alpha_t for each internal node t in the current tree
- Collapse the subtree T_t with the smallest \alpha_t (the “weakest link”)
- Repeat until only the root remains
This yields a sequence of subtrees T_0 \supset T_1 \supset \cdots \supset T_M. The m-th pruning step removes a node with cost-benefit ratio \alpha_{(m)}, giving thresholds 0 = \alpha_{(0)} < \alpha_{(1)} < \cdots < \alpha_{(M)}. For any penalty \alpha \in [\alpha_{(m)}, \alpha_{(m+1)}), the optimal subtree minimizing \mathcal{L}_{\alpha}(T) is T_m.
Code Example
We’ll use
rpartto fit regression trees to the 2D dataset below, which has strong interactions between x_1 and x_2 and isn’t a simple piecewise constant surface.library(rpart) library(rpart.plot) sim_data <- tibble( x1 = runif(500), x2 = runif(500), y = 10 * sin(pi * x1 * x2) + 20 * (x2 - 0.5)^2 + rnorm(500) )We fit a regression tree with default \alpha. Each leaf region R_{l} predicts \bar{y}_{R_{l}}.
fit <- rpart(y ~ x1 + x2, data = sim_data) rpart.plot(fit)This is the associated partition of the feature space \mathcal{X} = \left[0, 1\right]^{2}. Each leaf in the tree is one axis-aligned rectangle.
grid$pred <- predict(fit, newdata = grid)Let’s study the influence of \alpha. The
cphyperparameter in rpart is \alpha normalized by the root node’s RSS; the normalization makescpcomparable across datasets. We can see that when cp is small, we are left with a large tree (fine partition), while when cp is large, we have a simpler tree (coarse partition).par(mfrow = c(1, 3), mar = c(2, 2, 3, 1)) for (cp_val in c(0.1, 0.01, 0.001)) { rpart.plot( rpart(y ~ x1 + x2, data = sim_data, cp = cp_val), main = paste("cp =", cp_val) ) }fit$frameandfit$wherehelp inspect individual nodes.fit$framegives node statistics: sample counts, average values \bar{y}_{R_{l}}, and pruning thresholds \alpha_{m}.fit$framevar n wt dev yval complexity ncompete nsurrogate 1 x1 500 500 5748.81339 6.907771 0.320308139 1 1 2 x2 176 176 1357.90713 4.303991 0.133486128 1 0 4 x1 139 139 311.38191 3.226672 0.013370287 1 1 8 <leaf> 55 55 92.62064 2.307683 0.004669844 0 0 9 <leaf> 84 84 141.89799 3.828391 0.006085289 0 0 5 x1 37 37 279.13837 8.351216 0.030283514 1 0 10 <leaf> 10 10 21.29115 4.786924 0.010000000 0 0 11 <leaf> 27 27 83.75296 9.671324 0.004558962 0 0 3 x2 324 324 2549.51454 8.322170 0.224585945 1 1 6 x2 178 178 573.88705 6.514270 0.029379612 1 1 12 <leaf> 83 83 131.95147 5.472133 0.006684417 0 0 13 x1 95 95 273.03767 7.424769 0.022796693 1 1 26 <leaf> 52 52 87.78750 6.356709 0.004235767 0 0 27 <leaf> 43 43 54.19624 8.716377 0.010000000 0 0 7 x1 146 146 684.52480 10.526321 0.033186487 1 1 14 <leaf> 50 50 100.09881 9.180712 0.006617028 0 0 15 x2 96 96 446.74007 11.227159 0.033186487 1 1 30 <leaf> 56 56 135.34135 9.880095 0.006896626 0 0 31 <leaf> 40 40 67.51881 13.113050 0.003096152 0 0fit$wheretells which leaf each point belongs to; i.e., it groups similar points from the tree’s point of view. The vector names are the data.frame row numbers from the frame object above.table(fit$where)4 5 7 8 11 13 14 16 18 19 55 84 10 27 83 52 43 50 56 40